-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Cancel Job Attachments session action when transfer rates drop below threshold #143
Conversation
4b46f84
to
51eeb36
Compare
action_status=ActionStatus( | ||
state=ActionState.FAILED, | ||
fail_message=( | ||
"Input syncing failed due to successive low transfer rates (< 10 Kb/s). " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put this rate in based on the LOW_TRANSFER_RATE_THRESHOLD
value? If that constant is changed, this message will be inaccurate, and that could be confusing to diagnose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I have addressed it in the revision.
state=ActionState.FAILED, | ||
fail_message=( | ||
"Input syncing failed due to successive low transfer rates (< 10 Kb/s). " | ||
"The transfer rate was below the threshold for the last 1 minute." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same applies to the "1 minute" here, if the constant defining that is changed, this could be confusing.
failure_reason=( | ||
"Insufficient download speed: " | ||
"Input syncing failed due to successive low transfer rates (< 10 Kb/s). " | ||
"The transfer rate was below the threshold for the last 1 minute." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about the message to constant consistency here, maybe it should be taking the message from the ActionStatus.
32aa9a2
to
9eafde2
Compare
pyproject.toml
Outdated
"E501", | ||
"E722", | ||
"F811", | ||
] | ||
line-length = 100 | ||
|
||
[tool.ruff.isort] | ||
[tool.ruff.lint.isort] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated these option names, as I was getting the below warnings when running hatch run fmt
or hatch run lint
warning: The top-level linter settings are deprecated in favour of their counterparts in the `lint` section. Please update the following options in `pyproject.toml`:
- 'ignore' -> 'lint.ignore'
- 'isort' -> 'lint.isort'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, great to see some extra safeguards being added! 🚢 🦈
9eafde2
to
bdde066
Compare
…below threshold Signed-off-by: Gahyun Suh <132245153+gahyusuh@users.noreply.github.com>
bdde066
to
ca77fa0
Compare
…below threshold (#143) Signed-off-by: Gahyun Suh <132245153+gahyusuh@users.noreply.github.com> Signed-off-by: Graeme McHale <gmchale@amazon.com>
Problem: The PosixInstanceWorker and WindowsInstanceWorker classes implement a CMF Worker based on an ec2 host. In both cases they start with an empty AMI image (AL2023 or Win Server 2022) then set up users, install the agent, etc. Tests that need to use an AMI that already has users and the agent installed end up having to hack around a lot of assumptions in these classes, or reimplement everything for an ec2-based worker from scratch. Solution: Refactor the PosixInstanceWorker and WindowsInstanceWorker to split them into os-specific base classes that contain all of the generic worker implementation stuff (starting the worker, getting its id, etc), and a derived class that implements the user-setup, agent install, and the like. This makes space for tests that use an AMI with the users & agent pre-setup to derive from the os-specific base classes to get the generic ec2-worker test functionality. BREAKING CHANGE: PosixInstanceWorker has been renamed to PosixInstanceBuildWorker, and WindowsInstanceWorker has been renamed to WindowsInstanceBuildWorker. Signed-off-by: Daniel Neilson <53624638+ddneilson@users.noreply.github.com>
What was the problem/requirement? (What/Why)
During the SYNC_INPUT_JOB_ATTACHMENTS session actions, there might be a potential for the download operation to get stuck or stall due to unexpected issues. Currently, there is no mechanism to monitor or detect such scenarios. This poses a risk, especially when downloading many/large files or using slow/unstable network.
What was the solution? (How)
on_downloading_files
callback function, which is passed to the AssetSync'ssync_inputs
function to track the the file transfer progress. This callback is intended to be called at regular intervals - 1 second at the longest.< LOW_TRANSFER_RATE_THRESHOLD
), and if the count exceeds the specified threshold (>= LOW_TRANSFER_COUNT_THRESHOLD
), cancels the download and fails the current (SYNC_INPUT_JOB_ATTACHMENTS) action.What is the impact of this change?
Worker agent can now detect and halt potentially stuck SYNC_INPUT_JOB_ATTACHMENTS session actions. This ensures that resources are not wasted on prolonged stalled operations, and that users are timely notified about the issues.
How was this change tested?
hatch run lint && hatch run test
Run an end-to-end test (submitting a non-rendering job bundle --> running a CMF)
Also, confirmed that a telemetry data capturing this sync-inputs failure was sent correctly.
Was this change documented?
No.
Is this a breaking change?
No.